To start our preliminary analysis of the data, we plot how common each of these glitches are compared to eachother in each interferometer:

First of all, we can see that several glitch classes are specific to each interferometer (although some may just have few enough examples that they don’t show up in this graph). Another interesting thing to see is that Hanford generally has more glitches than Livingston, and that Blips, Koi Fish, and Low-Frequency Bursts appear to be the glitches with the most examples. To see whether this is accurate, we will now calculate summary statistics for each glitch class. The following summary statistics include the number of glitches of that class detected in each interferometer (columns n_H1 and n_L1) and the means of each of the predictor variables for each class.

## # A tibble: 22 × 8
##    label             n_H1  n_L1    snr peak_freq central_freq duration bandwidth
##    <chr>            <dbl> <dbl>  <dbl>     <dbl>        <dbl>    <dbl>     <dbl>
##  1 1080Lines          327     1   10.2      1111         2961     0.85      4730
##  2 1400Ripples          0    81   10.9      1527         1846     0.15      1654
##  3 Air_Compressor      55     3    8.7        48          320     0.41       567
##  4 Blip              1453   368   22.8       199          839     0.27      1595
##  5 Chirp               28    32   13.6       141          264     0.29       461
##  6 Extremely_Loud     266   181 2416.        140         2673     8.17      5311
##  7 Helix                3   276    8.8       134          263     0.09       326
##  8 Koi_Fish           517   189  139.        157         1834     1.75      3629
##  9 Light_Modulation   511     1   34.6       105         2000     2.34      3966
## 10 Low_Frequency_B…   166   455   29.9        16         2611     2.91      5208
## 11 Low_Frequency_L…    79   368   23.1        12         2630     3.94      5243
## 12 No_Glitch           91    59    9.3       183         1601     1.95      2915
## 13 None_of_the_Abo…    51    30   45.3       170         1744     2.72      3436
## 14 Paired_Doves        27     0   33.4        41         1270     0.42      2505
## 15 Power_Line         273   176   11.3        62          733     0.75      1367
## 16 Repeating_Blips    230    33   29.2       200         1650     0.31      3214
## 17 Scattered_Light    385    58   16.4        30         2175     2.61      4319
## 18 Scratchy            90   247    8.6       153         1223     1.45      2269
## 19 Tomte               61    42   16.2        47          833     0.73      1622
## 20 Violin_Mode        141   271   13.4      1673         1742     0.29      2637
## 21 Wandering_Line      42     0   27.8       667         2127     6.05      3929
## 22 Whistle              2   297    9.5      1093         2690     0.59      4788

From this summary data, we can see that the Machine Learning system has classed several glitches into categories from the “wrong” observeratory. This can be explained in the following way: enough glitches happen that even if a certain glitch type isn’t present in one interferometer, a burst of noise can appear with a random shape that the ML couldn’t classify well into the “correct” interferometer’s categories, and which happens to look like a glitch from another category, and would get classified into that; alternatively, since “None of the Above” is a category, this implies that the training data (classified by citizen scientists) is included in this dataset, and these mistakes were human errors, already present in the training data. We can also see that Koi Fish, one of the most prominent types of glitches, is also the loudest standard class (other than Extremely Loud glitches, which are the loudest by definition), while Scratchy, Helix, and Air Compressor glitches are the quietest, even quieter on average than the ‘No glitch’ category.

Preliminary Analysis

To begin our analysis of the data points themselves (other than just the averages for each glitch class) we plot out bandwidth by duration, with the colour of the points representing the label, to see how well we can group glitch classes by the general dimensions of the signal:

This image is a bit hard to read, since the labels take up so much room, and since there are so many data points on the graph; still, one can tell that in the lower part of the distribution the glitch population is dominated by the blue and purple colours of Koi Fish, Low-Frequency Bursts, and Low-Frequency Lines, with a sudden stripe of green (and assorted other colours) at the very bottom. Zooming in on the y-axis to glitches with durations less than 5 seconds gives us the following plot:

With this zoomed-in visualisation, we can see that the data has some artefacts, causing the values to line up on a grid; ignoring this, however, we notice the large cluster of green “Blip”-type glitches at the bottom, especially prominent in the bottom-left corner, as well as two major populations of 1080-Lines: one forming linear patterns in the lower-right-hand corner of the distribution, and the other one being around the left-hand side of the blip distribution. We also see a population of either violin modes or wandering lines in the center of the lower edge of the plot, its density peaking at bandwidths between 3000 and 4000. We also now see that there are many Koi Fish glitches spread through the background distribution, which we mixed in with the colour of the low-frequency bursts and lines in our earlier analysis. The last notable point that stands out in this graph is the fact that Koi fish and Low Frequency Bursts seem to dominate for most of the chart, with assorted scattered-light glitches among them as well.

Meanwhile, if we instead plot peak frequency by duration (using the same limit on duration), we get the following graph:

Here, we can clearly see spikes of 1080-Hz lines and 1400-Hz ripples at their respective frequencies, as well as several spikes of violin modes at frequencies just over 1000Hz, 1500Hz, and 2000Hz, among a background composed mostly of Whistles. Moving into lower frequencies, we see a cloud of Blips underneath another blob, mostly composed of Koi Fish. There are several distributions of other glitches at lower frequencies as well, but these are harder to see clearly because of how little room they take up on the graph; to solve this, we use the same transformation as the Gravity Spy spectrograms do: taking the logarithm of the frequency values.

In this new graph, while we can still see the high-frequency spikes, we can now also see many lower-frequency trends (as well as similar gridline textures as in the previous diagrams). One of the most striking, in my opinion, is the line at 20Hz that seperates the low frequency lines and bursts from the scattered light glitches. There is a similar line on the other side of the scattered light glitches which seperates them from most other glitches (although there is a small area outside this line where there are scattered lights mixed with other glitch types, surprisingly enough still bounded by vertical lines). I have outlined these areas in the following plot:

Moving back to the unmarked graph, we can see the 60Hz Power Line glitches as a line of orange-coloured points around the 60Hz-line, and the Air Compressor glitches as a similar, yellow line at around 45Hz. We also see that in this graph, blips are mostly found in a triangle from 40Hz to 700Hz, and with durations less than 1 second, with a cluster of Helixes in the center. The area above this triangle has a background patterned with the dark blues of Koi Fish and Light Modulation, the cyan colour of Scratchy glitches, and the pale blue of Tomtes. Restricting our graph to these types of glitches to get a better look at their distributions, we get the following graph:

Here, we can tell that Light Modulation has data points all across the graph, from around (1000, 0) to around (11, 5), while Tomtes are fairly localised between (32, 0) and (64, 1.3), with few exceptions. Helices are indeed clustered in the center of the Blip cluster, which is in most cases visibly seperate from the Koi Fish cluster, with Scratchy glitches being found throughout both of these.

Next, we plot out SNR by each of duration and peak frequency, and analyse the results of these graphs:

From the first plot, we can see that, while blips, koi fish, and Extremely Loud glitches form visually distinct categories, most of the other glitch classes are in the same region as eachother. The second plot (which is basically just a zoomed-out version of the first plot) emphasises these four distributions, while also showing a notable second group of Extremely Loud glitches that intersect the ‘other’ distribution. The third plot, however, is arguably the easiest graph to tell the different glitch classes apart on so far, with scratchy having a nearly-distinct region (although slightly overlapping with Helix and Blips), as well as Scattered Light (which has the most overlap with Tomtes, surprisingly) and most of the clusters from the peak frequency-duration plot, although seeming to mix up the other low-frequency glitches more than the original graph. To see whether we can combine all three of these variables into a single plot, we create a 3d visualisation of the data:

This interactive 3d chart is especially useful here, since we can isolate combinations of glitches and see the differences between them. For example, Blips and Koi fish can be seen to not only have a nearly-quadratic boundary in the peak_frequency-snr plane, but also that Koi Fish generally have longer durations than blips, sometimes dramatically so.

We also create charts of these three variables plotted against amplitude.

Here, we won’t note specific glitch distributions, but one may notice that the amplitude-to-SNR graph has a linear lower boundary, as well as several diagonal lines running parallel to the boundary. The fact that there is a relation here is unsurprising, as SNR calculation includes the signal’s strength; however, the exact relation here is unclear. In addition to this, the amplitude-to-peak-frequency graph has a lower boundary with a shape familiar to anyone analysing gravitational-wave data: the ASD-frquency graph of a detector (specifically, the LIGO O1 detectors): ASD-frequency graph (Image from the Gravitational Wave Open Science Center: https://gwosc.org/o1speclines/)

That the glitches all occur above this line is unsurprising, as the graph measures the minimum signal necessary at each frequency to detect a signal; glitches therefore must have amplitudes higher than this curve to be directly detected.

Creating a ‘model’ for the data

Next, we are going to manually create a ‘model’ to attempt to predict the glitch type based on these variables, as well as ifo, amplitude, phase, and central_frequency. Unlike an actual model, we will not be predicting all values simultaneously, but will have a more complex elimination approach to the data, which we will create manually.

To begin, we create a new dataframe, where we remove the glitches with multiple bliplets with similarities to other glitch classes (repeating blips and light modulation). We then add a new variable, predicted_label, for the predicted glitch classes, and classify all glitches with SNRs higher than 500 as Extremely Loud, since this will include all of the high-peak-frequency population of ELs, as well as much of the lower-frequency population, while also excluding as many other glitches as possible.

Next, we attempt to select the second population of Extremely_Loud glitches by noting that they have durations greater than 4s, peak frequencies between 18 and 55Hz, and SNRs over 30.